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Abstract 



In regular statistical models, the leave-one-out cross-validation is asymp- 
{Sj ■ totically equivalent to the Akaike information criterion. However, since many 

^! . learning machines are singular statistical models, the asymptotic behavior of 

the cross-validation remains unknown. In previous studies, we established the 
singular learning theory and proposed a widely applicable information crite- 
rion, the expectation value of which is asymptotically equal to the average 
Bayes generalization loss. In the present paper, we theoretically compare the 
^ ■ Bayes cross-validation loss and the widely applicable information criterion and 

prove two theorems. First, the Bayes cross-validation loss is asymptotically 
equivalent to the widely applicable information criterion as a random variable. 
Therefore, model selection and hyperparameter optimization using these two 
values are asymptotically equivalent. Second, the sum of the Bayes general- 
ization error and the Bayes cross-validation error is asymptotically equal to 
2A/n, where A is the real log canonical threshold and n is the number of train- 
ing samples. Therefore the relation between the cross-validation error and the 
generalization error is determined by the algebraic geometrical structure of a 
learning machine. We also clarify that the deviance information criteria are 
different from the Bayes cross-validation and the widely applicable informa- 
tion criterion. 

Keywords: Cross-validation, Information Criterion, Singular Learning Ma- 
chine, Birational Invariant 
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1 Introduction 



A statistical model or a learning machine is said to be regular if the map taking pa- 
rameters to probability distributions is one-to-one and if its Fisher information ma- 
trix is positive definite. If a model is not regular, then it is said to be singular. Many 
learning machines, such as artificial neural networks [Watanabe 01b] , normal mix- 
tures [Yamazaki &: Watanabe 03] , reduced rank regressions Aoyagi fc Watanabe 05] , 
Bayes networks Rusakov & Geiger 05, IZwiernik lOj . mixtures of probability dis- 
tributions |Lin 10] . Boltzmann machines Aoyagi 10 , and hidden Markov models 



[Ya mazaki fc Watanabe 05] . are not regular but singular | Watanabe 07] . If a statis- 
tical model or a learning machine contains a hierarchical structure, hidden variables, 
or a grammatical rule, then the model is generally singular. Therefore, singular 
learning theory is necessary in modern information science. 

The statistical properties of singular models have remained unknown until re- 



cently, because analyzing a singular likelihood function had been difficult |Hartigan 85 



Wa tanabe 95] . In singular statistical models, the maximum likelihood estimator 
does not satisfy asymptotic normality. Consequently, AIC is not equal to the av- 
erage generalization error |Hagiwara 02] , and the Bayes information criterion (BIC) 
is not equal to the Bayes marginal likelihood [Watanabe 01a] , even asymptotically. 
In singular models, the maximum likelihood estimator often diverges, or even if it 
does not diverge, makes the generalization error very large. Therefore, the maxi- 
mum likelihood method is not appropriate for singular models. On the other hand, 
Bayes estimation was proven to make the generalization error smaller if the statis- 
tical model contains singularities. Therefore, in the present paper, we investigate 
methods for estimating the Bayes generalization error. 

Recently, new statistical learning theory, based on methods from algebraic geom- 
etry, has been established [Watanabe 01a]lDrton et al. 09] IWatanabe 09~l IWatanabe 10a] 
IWatanabe lOcllLin 10] . In singular learning theory, a log likelihood function can be 
made into a common standard form, even if it contains singularities, by using the 
resolution theorem in algebraic geometry. As a result, the asymptotic behavior of 
the posterior distribution is clarified, and the concepts of BIC and AIC can be gen- 
eralized onto singular statistical models. The asymptotic Bayes marginal likelihood 
was proven to be determined by the real log canonical threshold [Wa tanabe 01a] , 
and the average Bayes generalization error was proven to be estimable by the widely 
applicable information criterion [Watanabe 09} IWatanabe 10a] IWatanabe 10c] . 

Cross-validation is an alternative method for estimating the generalization error 
[Mosier 5"T] IStone 77] IGeisser 75] . By definition, the average of the cross-validation 
is equal to the average generalization error in both regular and singular models. In 
regular statistical models, the leave-one-out cross-validation is asymptotically equiv- 
alent to AIC [Aka ike 74] in the maximum likelihood method [Stone 77] ILinhart "86] 
IBrowne "00] . However, the asymptotic behavior of the cross-validation in singular 
models has not been clarified. 

In the present paper, in singular statistical models, we theoretically compare the 
Bayes cross-validation, the widely applicable information criterion, and the Bayes 
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Table 1: Variables, Names, and Equation Numbers 

generalization error and prove two theorems. First, we show that the Bayes cross- 
validation loss is asymptotically equivalent to the widely applicable information 
criterion as a random variable. Second, we also show that the sum of the Bayes 
cross-validation error and the Bayes generalization error is asymptotically equal to 
2A/n, where A is the real log canonical threshold and n is the number of training 
samples. It is important that neither A or n is a random variable. Since the real log 
canonical threshold is a birational invariant of the statistical model, the relationship 
between the Bayes cross-validation and the Bayes generalization error is determined 
by the algebraic geometrical structure of the statistical model. 

The remainder of the present paper is organized as follows. In Section 2, we 
introduce the framework of Bayes learning and explain singular learning theory. In 
Section 3, the Bayes cross-validation is defined. In Section 4, the main theorems 
are proven. In Section 5, we discuss the results of the present paper, and the differ- 
ences among the cross-validation, the widely applicable information criterion, and 
the deviance information criterion are investigated theoretically and experimentally. 
Finally, in Section 6, we summarize the primary conclusions of the present paper. 

2 Bayes Learning Theory 

In this section, we summarize Bayes learning theory for singular learning machines. 
The results presented in this section are well known and are the fundamental basis 
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of the present paper. Table [T] lists variables, names, and equation numbers in the 
present paper. 

2.1 Framework of Bayes Learning 

First, we explain the framework of Bayes learning. 

Let q(x) be a probability density function on the N dimensional real Euclidean 
space R . The training samples and the testing sample are denoted by random 
variables X\, X2, X n and X, respectively, which are independently subject to 
the same probability distribution as q(x)dx. The probability distribution q(x)dx is 
sometimes called the true distribution. 

A statistical model or a learning machine is defined as a probability density 
function p(x\w) of x G M. N for a given parameter w G W C M. d , where W is the set 
of all parameters. In Bayes estimation, we prepare a probability density function 
(p(w) on W. Although (p(w) is referred to as a prior distribution, in general, (p(w) 
does not necessary represent an a priori knowledge of the parameter. 

For a given function f{w) on W, the expectation value of f(w) with respect to 
the posterior distribution is defined as 

» n 

I fW Y[p(Xi\wf ip(w)dw 
E«[/H] = pr^ , (1) 

J i=l 

where < /3 < 00 is the inverse temperature. The case in which = 1 is most 
important because this case corresponds to strict Bayes estimation. The Bayes 
predictive distribution is defined as 

p*(x)=E w [p(x\w)]. (2) 

In Bayes learning theory, the following random variables are important. The Bayes 
generalization loss B g L(n) and the Bayes training loss B t L{n) are defined, respec- 
tively, as 

B g L{n) = -E x [logp*(X)], (3) 
1 - 

B t L(n) = — y>gp*(JQ), (4) 

where Kx[ } gives the expectation value over X. The functional variance is defined 
as 

n 

V(n) = ^|E w [(logp(X») 2 ] - E^log^X.H] 2 }, (5) 

8=1 



4 



which shows the fluctuation of the posterior distribution. In previous papers | Watanabe 09 



IWatanabe 10aj IWatanabe 10b] , we defined the widely applicable information crite- 
rion 

WAIC(n) = B t L(n) + -V(n), (6) 

n 

and proved that 

E[B g L(n)] = E[WAIC(n)] + o(-), (7) 

holds for both regular and singular statistical models, where E[ ] gives the expec- 
tation value over the sets of training samples. 

Remark. Although the case in which /3 — 1 is most important, general cases in 
which < (5 < oo are also important for four reasons. First, from a theoretical 
viewpoint, several mathematical relations can be obtained using the derivative of f3. 
For example, using the Bayes free energy or the Bayes stochastic complexity, 

JF{$) = - log I 'f{p{X i \wfip{w)dw, (8) 

t=i 

the Gibbs training loss 

i n 

G t L{n) = -E w - Vlogp(X» (9) 

in ^— ' J 

i=i 

can be written as 

GtLin) = ^. (10) 

Such relations are useful in investigating Bayes learning theory. We use d 2 J r /d(3 2 
to investigate the deviance information criteria in Section [5j Second, the maximum 
likelihood method formally corresponds to (5 = oo. The maximum likelihood method 
is defined as 

p*(x) = p(x\w), (11) 

instead of eq. 02]), where w is the maximum likelihood estimator. Its generalization 
loss is also defined in the same manner as eq. ([3]). In regular statistical models, the 
asymptotic Bayes generalization error does not depend on < < oo, whereas in 
singular models it strongly depends on (3. Therefore, the general case is useful for 
investigating the difference between the maximum likelihood and Bayes methods. 
Third, from an experimental viewpoint, in order to approximate the posterior dis- 
tribution, the Markov chain Monte Carlo method is often applied by controlling /3. 
In particular, the identity 



is used in the calculation of the Bayes marginal likelihood. The theoretical results for 



general /3 are useful for monitoring the effect of controlling /3 Nagata &: Watanabe 08 
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Finally, in the regression problem, (3 can be understood as the variance of the un- 
known additional noise [Watanabe 10c] and so may be optimized as the hyperpa- 
rameter. For these reasons, in the present paper, we theoretically investigate the 
cases for general f3. 



2.2 Notation 

In the following, we explain the notation used in the present study. 

The log loss function L(w) and the entropy S of the true distribution are defined, 
respectively, as 

L{w) = -E x [logp(X\w)}, (13) 
S = -E x [logg(X)]. (14) 

Then, L{w) = S + D(q\\p w ), where D(q\\p w ) is the Kullback-Leibler distance defined 

as 

D(q\\ Pw )= [ q(x)log-^-dx. (15) 
J p[x\w) 

Then, D(q\\p w ) > 0, hence L{w) > S. Moreover, L{w) = S if and only if p(x\w) = 
q(x). 

In the present paper, we assume that there exists a parameter wo G W that 
minimizes L(w), 

L{w$) = minL(w). (16) 

Note that such w is not unique in general because the map w H- p(x\w) is, in general, 
not a one-to-one map in singular learning machines. In addition, we assume that, for 
an arbitrary w that satisfies L{w) = L(w ), p(x\w) is the same probability density 
function. Let Po^) be such a unique probability density function. In general, the 
set 

W = {w e W;p(x\w) = p {x)} (17) 

is not a set of a single element but rather an analytic or algebraic set with singular- 
ities. Here, a set in M. d is said to be an analytic or algebraic set if and only if the set 
is equal to the set of all zero points of an analytic or algebraic function, respectively. 
For simple notations, the minimum log loss Lq and the empirical log loss L n are 
defined, respectively, as 

L = -E x [logpoPO], (18) 
1 n 

L n = — ViogpoM- (19) 
n ^— ' 

i=l 

Then, by definition, L = E[L n ]. Using these values, Bayes generalization error 
B g (n) and Bayes training error B t (n) are defined, respectively, as 

B g {n) = B g L(n) — L , (20) 
B t (n) = B t L{n)-L n . (21) 
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Let us define a log density ratio function as: 

/(*,«;)= log -^r, (22) 
p{x\w) 

which is equivalent to 

p(x\w) — po{x) exp(— f(x, w)). (23) 
Then, it immediately follows that 

B g (n) = -E x [\ogE w [exp(-f(X, w))}}, (24) 
1 n 

Bt(n) = — y>gE w [exp(-/(X i; w))], (25) 

i=l 

n 

V(n) = ^{E w [/(X i , W ) 2 ]-^[/(X J , W )] 2 }. (26) 

i=l 

Therefore, the problem of statistical learning is characterized by the function f(x, w). 
Definition. 

(1) If q(x) = Po(x), then q(x) is said to be realizable by p(x\w). Otherwise, q(x) is 
said to be unrealizable. 

(2) If the set Wo consists of a single point wq and if the Hessian matrix VVL(tuo) is 
strictly positive definite, then q(x) is said to be regular for p(x\w). Otherwise, q(x) 
is said to be singular for p(x\w). 



Bayes learning theory was investigated for a realizable and regular case [Schwarz 78 



ILevin et al. ~9Q\ lAamari 93] . The WAIC was found for a realizable and singular case 
[Watanabe 01a| IWatanabe 09\ IWatanabe 10a] and for an unrealizable and regular 
case [Watanabe 10b] . In addition, WAIC was generalized for an unrealizable and 
singular case [Watanabe 10dj . 



2.3 Singular Learning Theory 

We summarize singular learning theory. In the present paper, we assume the fol- 
lowings. 

Assumptions. 

(1) The set of parameters W is a compact set in W 1 , the open kernel of which is 
not the empty set. The boundary of W is defined by several analytic functions, 

W = {w e M. d ; ttiH > 0, n 2 {w) > 0, 7r k {w) > 0}. (27) 

(2) The prior distribution satisfies y?(u>) = (pi(w)(p2(w), where (pi(w) > is an 
analytic function and <f2{w) > is a C°°-class function. 

1 The open kernel of a set A is the largest open set that is contained in A. 
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(3) Let s > 8 and let 

l/s 



L s {q) = {/(*); ll/H = ( / \f{x)\ s q{x)dx) ' < oo} (2* 



be a Banach space. The map 9 u; i— > f(x, w) is an L s (q) valued analytic function. 
(4) A nonnegative function K(w) is denned as 

K(w) = E x [f(X,w)]. (29) 

The set W e is defined as 

W e = {w G W ; tf(tw) < e}. (30) 
It is assumed that there exist constants e, c > such that 

(\/weW t ) E x [f(X,w)]>cE x [f(X,w) 2 ]. (31) 

Remark. In ordinary learning problems, if the true distribution is regular for or 
realizable by a learning machine, then assumptions (1), (2), (3) and (4) are satisfied, 
and the results of the present paper hold. If the true distribution is singular for 
and unrealizable by a learning machine, then assumption (4) is satisfied in some 
cases but not in other cases. If the assumption (4) is not satisfied, then the Bayes 
generalization and training errors may have asymptotic behaviors other than those 
described in Lemma [T] [Watanabe 10d] . 



The investigation of cross-validation in singular learning machines requires sin- 
gular learning theory. In previous papers, we obtained the following lemma. 

Lemma 1. Assume that assumptions (1), (2), (3), and (4) are satisfied. Then, the 
followings hold. 

(1) Three random variables nB g (n), nB t (n), and V(n) converge in law, when n 
tends to infinity. In addition, the expectation values of these variables converge. 

(2) For k = 1,2, 3, 4, we define 

MM S sup E [I ± ^m^^fiX ^ ] 

where E[ ] gives the average over all sets of training samples. Then, 

limsup^jn^ 2 M fc (n)) < oo. (33) 

(3) The expectation value of the Bayes generalization loss is asymptotically equal to 
the widely applicable information criterion, 

E[B g L(n)] = E[WAIC(n)} + o(-). (34) 
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(Proof) For the case in which q(x) is realizable by and singular ior p(x\w) , this lemma 
was proven in [Watanabe 10a\ IWatanabe 09] . In fact, the proof of Lemma [1] (1) is 
given in Theorem 1 of [ Watanabe 10a] . Also Lemma [TJ (2) can be proven in the same 
manner as eq. (32) in [Watanabe 10a] or eq. (6.59) in [Watanabe 09] . The proof of 
Lemma [TJ (3) is given in Theorem 2 and the discussion of [Watanabe 10a] . For the 
case in which q(x) is regular for and unrealizable by p(x\w), this lemma was proven 
in [Watanabe 10b] . For the case in which q(x) is singular for and unrealizable by 
p(x\w), these results can be generalized under the condition that eq. (l3~T|) is satisfied 
[Watanabe 10d] . (Q.E.D.) 

3 Bayes Cross-validation 

In this section, we introduce the cross-validation in Bayes learning. 

The expectation value E^ [ ] using the posterior distribution leaving out Xi is 
defined as 



jQp(Xj|u>) /3 f(w)dw 

] = " - „'* , (35) 

JJp(Xj|iu)^ (p(w)dw 



n 

where shows the product for j = 1, 2, 3, .., n, which does not include j = i. The 

predictive distribution leaving out Xi is defined as 

p {i \x) = E^[p(x\w)}. (36) 

The log loss of pW (x) when Xi is used as a testing sample is 

-logpWpQ) = -logE«[p(X»]. (37) 

Thus, the log loss of the Bayes cross-validation is defined as the empirical average 
of them, 

1 n 

Cv L{n) = — ^logE«[p(X»]. (38) 

i=l 

The random variable C v L(n) is referred to as the cross-validation loss. Since Xi, X2, X n 
are independent training samples, it immediately follows that 

E[C v L(n)] = E[B g L(n - 1)]. (39) 

Although the two random variables C v L(n) and B g L(n — 1) are different, 

C v L(n) ^ B g L(n - 1), (40) 
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their expectation values coincide with each other by the definition. Using eq. (jMj) . 
it follows that 



Therefore, three expectation values K[C v L(n)}, K[B g L(n — 1)], and E[WAIC(n — 1)] 
are asymptotically equal to each other. The primary goal of the present paper is 
to clarify the asymptotic behaviors of three random variables, C v L(n), B g L(n), and 
WAIC(n), when n is sufficiently large. 

Remark. In practical applications, the Bayes generalization loss B g L(n) indicates 
the accuracy of Bayes estimation. However, in order to calculate B g L(n), we need 
the expectation value over the testing sample taken from the unknown true distri- 
bution, hence we cannot directly obtain B g L(n) in practical applications. On the 
other hand, both the cross-validation loss C v L(n) and the widely applicable infor- 
mation criterion WAIC(n) can be calculated using only training samples. Therefore, 
the cross-validation loss and the widely applicable information criterion can be used 
for model selection and hyperparameter optimization. This is the reason why com- 
parison of these random variables is an important problem in statistical learning 
theory. 

4 Main Results 

In this section, the main results of the present paper are explained. First, we define 
functional cumulants and describe their asymptotic properties. Second, we prove 
that both the cross-validation loss and the widely applicable information criterion 
can be represented by the functional cumulants. Finally, we prove that the cross- 
validation loss and the widely applicable information criterion are related to the 
birational invariants. 

4.1 Functional Cumulants 

Definition. The generating function F(a) of functional cumulants is defined as 



E[C v L(n)) = E[WAIC(n - 1)] + o(-). 



(41) 




(42) 



i=l 



The kth order functional cumulant Y k {n) {k = 1,2,3,4) is defined as 



Y k (n) 



d k F 



(0). 



(43) 
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Then, by definition, 

F(0) = 0, 

F(l) = -B t L(n), 

Y x {n) = -G t L{n), 

Y 2 (n) = V(n)/n. 

For simple notation, we use 

e k (Xi) = E w [(logp(X») fc ] (k = 1,2,3,4). 
Lemma 2. Then, the following hold: 

n 

i=l 
1 - 

i=l 
1 " 

Y 3 (n) = -Y,{h{X i )-U 2 {X l )i 1 {X l ) + 2i 1 {X i )' 

i=l 
1 - 

Y A {n) = -J2{^(X)-^3(X l )£ 1 (X i )-3£ 2 (X i y 

i=l 

+12£ 2 (X,)£ 1 (X,) 2 -6£ 1 (X,) / 

Moreover, 

F fc (n) = O p (^) (/c = 2,3,4). 

In ot/ier words, 

limsup^E^/ 2 |F fc (n)|] < oo (k = 2,3,4). 
(Proof) First, we prove Eqs. (|49p through fl52l) . Let us define 

#(a) =E w [p(X l \w) a ]. 

Then, o,(0) = 1, 

9 f»(0)E^(0) = 4(X t ) (A; = 1,2,3,4), 

and 

1 - 

= -yi io g^(«)- 
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For arbitrary natural number fc, 

' flt(g) w y = _ p(«) (fc) \ (gtfrO'y (58) 

flfi(a) / ^i(a) V 9i{a) /\ #;(«)/' 

By applying this relation recursively, eqs. f[4"9~[) . ([50]) . flBT]) . and f[5"2"[) are derived. Let 
us prove eg. ( [541) . The random variables Yk(n) (k = 2,3,4) are invariant under the 
transform, 

logppQH ^ \ogp(Xi\w) + c(Xi), (59) 
for arbitrary cpQ). In fact, by replacing p(Xj|u>) by p(Xi\w)e c ^ Xi \ we define 

1 - 

F(a) = -^logE w [p(X» Q e ac ^]. (60) 

i=l 

Then, the difference between F(a) and F(ot) is a linear function of a, which vanishes 
by higher-order differentiation. In particular, by selecting c(Xj) = — \ogp (Xi), we 
can show that Yk(n) (k = 2,3,4) are invariant by the following replacement, 

logppf» ^f(X h w). (61) 
In other words, Yk(n) (n = 2,3,4) are invarianrt by the replacement, 

4 

Using the Cauchy-Schwarz inequality, for 1 < k' < k, 

E^fiX^wt'} 1 /"' < E w [|/(X i , W )|Y /fc . (63) 
Therefore, for k = 2, 3, 4, 

E[|y fc (n)|] <e[^ VE„[|/(X i , W )| fc ]l <C k M k (n), (64) 
in L — ' J 

i=l 

where C 2 = 2, C 3 = 6, C 4 = 26. Then, using eq. ([33}, we obtain eq. ([54]). (Q.E.D.) 



Remark. Using eq. f[5"9~l) with c(Xj) = — E u ,[logp(Xj|w)] and the normalized func- 
tion defined as 

t k (Xi) = E w [(\ogp(X t \w) - c(X)) fc ], (65) 



{X i )^E w [f(X i ,w) k ). (62) 



it follows that 



Yo(n 



- -Y,t 2 (X t ), (66) 
1 - 

= -E £ bW, (67) 

Y A {n) = _^|£*(X)-3^(X) 2 }. (68) 



Y*(n 



n . 



These formulas may be useful in practical applications. 
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4.2 Bayes Cross-validation and Widely Applicable Informa- 
tion Criterion 

We show the asymptotic equivalence of the cross-validation loss C v L(n) and the 
widely applicable information criterion WAIC(n). 

Theorem 1. For arbitrary < /3 < oo, the cross-validation loss C v L(n) and the 
widely applicable information criterion WAIC(n) are given, respectively, as 

C v L(n) = -F 1 (n)+(^^)F 2 (n) 

_ ( 3£^ )w+0p( _^ (6g) 

2/3 -V 



WAIC(n) = -Y 1 (n)+ (^— 



-\Y,{n) + O p {^). (70) 



(Proof) First, we consider C v L(n). From the definitions of K w [ ] and [ ], we 
have 

m( E w [( )p{Xj\w)- p ] 

Therefore, by the definition of the cross-validation loss, eq. fl38|) . 

Using the generating function of functional cumulants F(a), 

C v L(n)=F(-f3)-F(l-f3). (73) 
Then, using Lemma [1] (2) for each k — 2, 3, 4, and |a| < 1 + (3, 

E[lF ia)l] ~ E i~h E.[exp (a/ (X^))] 

< C k M k (n), (74) 

where C2 = 2, C3 = 6, C4 = 26. Therefore, 

l^ (fc) («)l=O p (^). (75) 
By Taylor expansion of F(a) among a = 0, there exist (3*, (3** (\(3*\, \(3**\ < 1 + (3) 
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such that 



F(-0) = F(0) - /3F'(0) + |V'(0) 

_^ F (3) (0) + ^ F (4) (r)) (76) 

F(l-P) = F(0) + (1 - P)F'(0) + (1 ~ ^ V (0) 

, (! - ^V 3 >(0) + (1 ~/ )4 jr(4) (/n- (77) 



6 v 7 24 

Using F(0) = and Eqs. (j75]l and ([75]), it follows that 

C^(n) = -F'(0) + ^V'(O) 



6 p n 2 ' 

Thus, we have proven the first half of the theorem. For the latter half, by the 

definitions of WAIC(n), Bayes training loss, and the functional variance, we have 

WAIC(n) = B t L(n) + (/3/n)V(n), (79) 
B t L(n) = -F(l), (80) 
V(n) = nF"(0). (81) 

Therefore, 

WAIC(n) = -F(l) + (3F"(0). (82) 
By Taylor expansion of F(l), we obtain 



WAIC(n) = -F'(0) + ^— ^"(0) - V 3 )(0) + O p (l), (83) 

2 o n 



which completes the proof. (Q.E.D.) 

From the above theorem, we obtain the following corollary. 

Corollary 1. For arbitrary < < oo, the cross-validation loss C v L(n) and the 
widely applicable information criterion WAIC(n) satisfy 

C v L(n) = WAIC(n) + Op(-L). (84) 



In particular, for (3 = 1, 



C v L{n) = WAIC(n) + O p (\). (85) 
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More precisely, the difference between the cross-validation loss and the widely 
applicable information criterion is given by 



C v L(n) - WAIC(n) = y 3 (n). (86) 

If P = 1, 

C,L(n) - WAIC(n) = ^y 4 (n). (87) 

4.3 Generalization Error and Cross-validation Error 

In the previous subsection, we have shown that the cross-validation loss is asymp- 
totically equivalent to the widely applicable information criterion. In this section, 
let us compare the Bayes generalization error B g (n) given in eq. ( 1201) and the cross- 
validation error C v (n), which is defined as 

C v (n) = C v L(n) - L n . (88) 

We need mathematical concepts, the real log canonical threshold, and the singular 
fluctuation. 

Definition. The zeta function ((z) (Re(z) > 0) of statistical learning is defined as 

C(z) = [ K(w) z <p(w)dw, (89) 



where 

K(w)=E x [f{X,w)} (90) 

is a nonnegative analytic function. Here, ((z) can be analytically continued to the 
unique meromorphic function on the entire complex plane C. All poles of ((z) are 
real, negative, and rational numbers. The maximum pole is denoted as 

(—A) = maximum pole of ({z). (91) 

Then, the positive rational number A is referred to as the real log canonical threshold. 
The singular fluctuation is defined as 

v = u (p) = lim ^-E[V(n)\. (92) 

n—>oo 2 

Note that the real log canonical threshold does not depend on (3, whereas the singular 
fluctuation is a function of (3. 



Both the real log canonical threshold and the singular fluctuation are birational 
invariants. In other words, they are determined by the algebraic geometrical struc- 
ture of the statistical model. The following lemma was proven in a previous study 
[Watanabe 10al IWatanabe lObi IWatanabe 10d] . 
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Lemma 3. The following convergences hold: 



lim nE[B g (n)] = + u, (93) 

n— too p 

lim nE[B t (n)] = — ^ - u, (94) 



Moreover, convergence in probability 



2A 

n{B g {n) + Btin)) + V{n) — (95) 



holds. 



(Proof) For the case in which q(x) is realizable by and singular for p(x\w), eqs. (|93|) 
and (ESP were proven by in Corollary 3 in [Watanabe 10aj . The equation (l95il was 
given in Corollary 2 in [Watanabe 10a] . For the case in which q(x) is regular for 
p(x\w), these results were proved in [Watanabe 10bj . For the case in which q(x) is 
singular for and unrealizable by p(x\w) they were generalized in [Watanabe 10dj . 
(Q.E.D.) 

Examples. If q(x) is regular for and realizable by p(x\w), then A = v = d/2, where 
d is the dimension of the parameter space. If q(x) is regular for and unrealizable by 
p(x\w), then A and v are given by [Watanabe 10b] . If q{x) is singular for and realiz- 
able by p(x\w), then A for several models are obtained by resolution of singularities 
Aoyagi fc Watanabe 05[ |Rusakov fc Geiger 05[ lYamazaki fc Watanabe 03l ILin 101 



IZwiernik 10] . If q(x) is singular for and unrealizable by p(x\w), then A and v remain 
unknown constants. 

We have the following theorem. 

Theorem 2. The following equation holds: 

lim nE[C v {n)\ = + is, (96) 

n— >oo p 

The sum of the Bayes generalization error and the cross-validation error satisfies 
B g (n) + C v (n) = 03 - + ^ + o p (\ (97) 

It Ls f v It 



In particular, if (5 = 1, 



2A 1 

B g {n )+C v {n ) = — + o p - . 9^ 
n n 



(Proof) By eq. (1531) . 



E[B a (n-l)]=(^ + !/)! + (99) 
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Since E[C v (n)] = E[B g {n - 1)], 



lim nE[C„(n)] = lim nE[S„(n - 1)] (100) 

n— >oo n— >-oo 

A — ^ . 

+ 1/. (101) 







From eq. fl95|) and Corollary [TJ 



Bt(n) = 0,(n) - -V(n) + O p (-^-), (102) 



and it follows that 



and 



(B g (n) + C v (n)) = ((3- + ^ + 0p (I), (i 03 ) 

/ 1 ft It 

which proves the Theorem. (Q.E.D.) 

This theorem indicates that both the cross-validation error and the Bayes gener- 
alization error are determined by the algebraic geometrical structure of the statistical 
model, which is extracted as the real log canonical threshold. From this theorem, 
in the strict Bayes case /3 = 1, we have 

E[B g (n)} = - + o(-), (104) 
n n 

E[C v (n)} = ^ + o(^), (105) 
2A 1 

B g (n) + C v (n) = — + o p (-). (106) 

n n 

Therefore, the smaller cross-validation error C v {n) is equivalent to the larger Bayes 
generalization error B g (n). Note that a regular statistical model is a special example 
of singular models, hence both Theorems 1 and 2 also hold in regular statistical 
models. In [W atanabe 09] , it was proven that the random variable nB g {n) converges 
to a random variable in law. Thus, nC v (n) converges to a random variable in 
law. The asymptotic probability distribution of nB g (n) can be represented using a 
Gaussian process, which is defined on the set of true parameters, but is not equal 
to the x 2 distribution in general. 

Remark. The relation given by eq. H106[) indicates that, if /5 = 1, the variances of 
B g {n) and C v {n) are equal. If the average value 2u = E[V(n)] is known, then B t {n) + 
2u/n can be used instead of C v (n), because both average values are asymptotically 
equal to the Bayes generalization error. The variance of B t (n) + 2v/n is smaller 
than that of C v (n) if and only if the variance of B t (n) is smaller than that of B g (n) . 
If a true distribution is regular for and realizable by the statistical model, then the 
variance of B t {n) is asymptotically equal to that of B g {n). However, in other cases, 
the variance of B t (n) may be smaller or larger than that of B g {n). 
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5 Discussion 

Let us now discuss the results of the present paper. 



5.1 From Regular to Singular 

First, we summarize the regular and singular learning theories. 

In regular statistical models, the generalization loss of the maximum likelihood 
method is asymptotically equal to that of the Bayes estimation. In both the maxi- 
mum likelihood and Bayes methods, the cross-validation losses have the same asymp- 
totic behaviors. The leave-one-out cross-validation is asymptotically equivalent to 
the AIC, in both the maximum likelihood and Bayes methods. 

On the other hand, in singular learning machines, the generalization loss of the 
maximum likelihood method is larger than the Bayes generalization loss. Since 
the generalization loss of the maximum likelihood method is determined by the 
maximum value of the Gaussian process, the maximum likelihood method is not 
appropriate in singular models |Watanabe 09] . In Bayes estimation, we derived 
the asymptotic expansion of the generalization loss and proved that the average 
of the widely applicable information criterion is asymptotically equal to the Bayes 
generalization loss [Watanabe 1 0a]. In the present paper, we clarified that the leave- 
one-out cross-validation in Bayes estimation is asymptotically equivalent to WAIC. 

It was proven jWatanabe Olaj that the Bayes marginal likelihood of a singu- 
lar model is different from BIC of a regular model. In the future, we intend to 
compare the cross-validation and Bayes marginal likelihood in model selection and 
hyperparameter optimization in singular statistical models. 

5.2 Cross- validation and Importance Sampling 

Second, let us investigate the cross-validation and the importance sampling cross- 
validation from a practical viewpoint. 

In Theorem 1, we theoretically proved that the leave-one-out cross-validation is 
asymptotically equivalent to the widely applicable information criterion. In practi- 
cal applications, we often approximate the posterior distribution using the Markov 
Chain Monte Carlo or other numerical methods. If the posterior distribution is 
precisely realized, then the two theorems of the present paper hold. However, if 
the posterior distribution was not precisely approximated, then the cross-validation 
might not be equivalent to the widely applicable information criterion. 

In Bayes estimation, there are two different methods by which the leave-one-out 
cross-validation is numerically approximated. In the former method, CV± is obtained 
by realizing all posterior distributions E$ [ ] leaving out Xi for i — 1, 2, 3, n, and 
the empirical average 




(107) 



i=l 
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is then calculated. In this method, we must realize n different posterior distributions, 
which requires heavy computational costs. 

In the latter method, the posterior distribution leaving out Xi is estimated using 
the posterior average E m [ ], in the same manner as eq. (J7T]1 . 



EStah)] = w[ T^l n ^7 ■ (106 



E w [p(X t 


w) p(Xi 


w)-P] 


E w [p(Xi 


w)~ 


] 



This method is referred to as the importance sampling leave-one-out cross-validation 
[Gelfand et al. 92] . in which only one posterior distribution is needed and the leave- 
one-out cross-validation is approximated by CV2, 

^,-^lo g ^p. (109) 

If the posterior distribution is completely realized, then CV% and CV2 coincide 
with each other and are asymptotically equivalent to the widely applicable informa- 
tion criterion. However, if the posterior distribution is not sufficiently approximated, 
then the values CV\, CV 2) and WAIC(n) might be different. 

The average values using the posterior distribution may sometimes have infinite 
variances Peruggia 97 if the set of parameters is not compact. Moreover, in sin- 
gular learning machines, the set of true parameters is not a single point but rather 
an analytic set, hence we must restrict the parameter space to be compact for well- 
defined average values. Therefore, we adopted the assumptions in Subsection 12.31 
that the parameter space is compact and the log likelihood function has the appro- 
priate properties. Under these conditions, the observables studied in the present 
paper have finite variances. 



5.3 Comparison with the Deviance Information Criteria 

Third, let us compare the deviance information criterion (DIC) |Spiegelhalter et al. 02| 
to the Bayes cross-validation and WAIC, because DIC is sometimes used in Bayesian 
model evaluation. In order to estimate the Bayesian generalization error, DIC is 
written by 

2 n 

Did = B t L{n) + -^|-S w pogp(X»] + logp{X i \E w [w})}, (HO) 

i=i 

where the second term of the right-hand side corresponds to the "effective number 
of parameters" of DIC divided by the number of parameters. Under the condition 
that the log likelihood ratio function in the posterior distribution is subject to the 
X 2 distribution, a modified DIC was proposed [German et al. 04] as 



DIC 2 = B t L{n) + - 
n 



n 2 n 

£ tt [{X> g p(*iM} ] -^Eiogp(x»] 2 ], (in) 



1=1 1=1 
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the variance of which was investigated previously |Raftery "07 . Note that DIC 2 
is different from WAIC. In a singular learning machine, since the set of optimal 
parameters is an analytic set, the correlation between different true parameters 
does not vanish, even asymptotically 

We first derive the theoretical properties of DIC. If the true distribution is regu- 
lar for the statistical model, then the set of the optimal parameter is a single point 
wq. Thus, the difference of ^[w] and the maximum a posteriori estimator is asymp- 
totically smaller than l/yfn. Therefore, based on the results in [Watanabe 10b] . if 
/3 = 1, 

mDId] = U + (3A - 2z/(l))- + o(-). (112) 

n n 

If the true distribution is realizable by or regular for the statistical model and if 
(3 = 1, then the asymptotic behavior of DIC 2 is given by 

E[DIC 2 ] =L + (3A - 2i/(l) + 2z/(l))- + o(-), (113) 

n n 

where v'iX) = (dv / d[3){l). Equation ( I113P is derived from the relations [Watanabe 09} 
I Watanabe 10 at IWatanabe 10b| IWatanabe 10dj . 

DIC 2 = B t L(n)-2-^-G t L(n), (114) 

E[G t L(n)\ = La+(^-v(/3))- + o(-), (115) 

\p / n n 

where G t L{n) is given by eq. (J9]). 

Next, let us consider the DIC for each case. If the true distribution is regular for 
and realizable by the statistical model and if = 1, then X = u = d/2, = 0, 
where d is the number of parameters. Thus, their averages are asymptotically equal 
to the Bayes generalization error, 

E[D/d] = L + ^ + o(-), (116) 
In n 

E[DIC 2 ] = L + ^ + o(-). (117) 
2n n 

In this case, the averages of DIC\, DIC 2 , CV±, CV 2 , and WAIC have the same 
asymptotic behavior. 

If the true distribution is regular for and unrealizable by the statistical model 
and if /3 = 1, then A = d/2, v = tr(/J _1 ), and z/'(l) = [Watanabe 10b] . where 
/ is the Fisher information matrix at wq, and J is the Hessian matrix of L(w) at 
w = wo. Thus, we have 

E[DICi] = L +(^-tr(IJ- 1 ))- + o(-), (118) 

V 2 / n n 

E[DIC 2 ] = L +(^-tr(IJ- 1 ))- + o(-). (119) 

V 2 / n n 
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In this shown in Lemma [31 the Bayes generalization error is given by L + 

d/(2n) asymptotically, and so the averages of the deviance information criteria are 
not equal to the average of the Bayes generalization error. 

If the true distribution is singular for and realizable by the statistical model and 
if (3 = 1, then 

E[DId] = C + o(l), (120) 
E[DIC 2 ] = L + (3A-2z/(l) + 2z/ / (l))i + o(-), (121) 

where C (C ^ L ) is, in general, a constant. Equation (I120p is obtained because the 
set of true parameters in a singular model is not a single point, but rather an analytic 
set, so that, in general, the average £^[10] is not contained in the neighborhood of 
the set of the true parameters. Hence the averages of the deviance information 
criteria are not equal to those of the Bayes generalization error. 

The averages of the cross-validation loss and WAIC have the same asymptotic 
behavior as that of the Bayes generalization error, even if the true distribution 
is unrealizable by or singular for the statistical model. Therefore, the deviance 
information criteria are different from the cross-validation and WAIC, if the true 
distribution is singular for or unrealizable by the statistical model. 



5.4 Experiment 

In this section, we describe an experiment. The purpose of the present paper is to 
clarify the theoretical properties of the cross-validation and the widely applicable 
information criterion. An experiment was conducted in order to illustrate the main 
theorems. 

Let x,y G M 3 . We considered a statistical model defined as 

/ 1 \ s(x) , \\y - R H (x,w)\\ 2 
V{XMW) = (2W^ 6XP( 2^ } ' (122) 

where a = 0.1 and s(x) is A/"(0, 2 2 J). Here, A/"(m, A) exhibits a normal distribution 
with the average vector m and the covariance matrix A, and I is the identity matrix. 
Note that the distribution s(x) was not estimated. We used a three-layered neural 
network, 

H 

Rh{x, w) = ah tanh(fe fe ■ x), (123) 

h=l 

where the parameter was 

w = {(a h e M 3 , bh G M 3 ) ; h= 1, 2, H} G R 6H . (124) 

In the experiment, a learning machine with H = 3 was used and the true distribution 
was set with H = 1. The parameter that gives the distribution is denoted as 
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wo, which denotes the parameters of both models H = 1,3. Then, Rh(x,wq) = 
Rh (x,w ). Under this condition, the set of true parameters 



{w G W;p(x\w) = p(x\w )} (125) 

is not a single point but an analytic set with singularities, resulting that the regular- 
ity condition is not satisfied. In this case, the log density ratio function is equivalent 
to 

f(x, y, w) = ^{\\y - R H (x, w)f- \\y - R H (x, w ) || 2 }. (126) 

In this model, although the Bayes generalization error is not equal to the average 
square error 

SE(n) = 2^EEx[|| Rh(X,w ) -E w [R H (X,w)] f], (127) 

asymptotically SE(n) and B g {n) are equal to each other [Watanabe 09] . 

The prior distribution (p(w) was set as Af(0, 10 2 I). Although this prior does not 
have compact support mathematically, it can be understood in the experiment that 
the support of ip(w) is essentially contained in a sufficiently large compact set. 

In the experiment, the number of training samples was fixed as n = 200. One 
hundred sets of 200 training samples each were obtained independently. For each 
training set, the strict Bayes posterior distribution — 1 was approximated by the 
Markov chain Monte Carlo (MCMC) method. The Metropolis method, in which 
each random trial was taken from A/"(0, (0.005) 2 /), was applied, and the average 
exchanging ratio was obtained as approximately 0.35. After 100,000 iterations of 
Metropolis random sampling, 200 parameters were obtained in every 100 sampling 
steps. For a fixed training set, by changing the initial values and the random seeds 
of the software, the same MCMC sampling procedures were performed 10 times 
independently, which was done for the purpose of minimizing the effect of the local 
minima. Finally, for each training set, we obtained 200 x 10 = 2, 000 parameters, 
which were used to approximate the posterior distribution. 

Table 2 shows the experimental results. We observed the Bayes generalization 
error BG = B g (n), the Bayes training error BT = B t (n), importance sampling leave- 
one-out cross-validation CV = CV2 — L n , the widely applicable information criterion 
WAIC = WAIC(n)— L n , two deviance information criteria, namely, DIC1 = DIC\ — 
L n and DIC2 = DIC 2 - L n , and the sum BG + CV = B g {n) + C v {n). The 
values AVR and STD in Table 2 show the average and standard deviation of one 
hundred sets of training data, respectively. The original cross-validation CV\ was 
not observed because the associated computational cost was too high. 

The experimental results reveal that the average and standard deviation of BG 
were approximately the same as those of CV and WAIC, which indicates that The- 
orem 1 holds. The real log canonical threshold, the singular fluctuation, and its 
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BG 


BT 


CV 


WAIC 


D1C\ 


DIC2 


BG + CV 


AVR 


0.0264 


-0.0511 


0.0298 


0.0278 


-35.1077 


0.0415 


0.0562 


STD 


0.0120 


0.0165 


0.0137 


0.0134 


19.1350 


0.0235 


0.0071 



Table 2: Average and standard deviation 





BG 


BT 


CV 


WAIC 


DIC1 


DIC2 


BG + CV 


BG 


1.000 


-0.854 


-0.854 


-0.873 


0.031 


-0.327 


0.043 


BT 




1.000 


0.717 


0.736 


0.066 


0.203 


-0.060 


CV 






1.000 


0.996 


-0.087 


0.340 


0.481 


WA 








1.000 


-0.085 


0.341 


0.443 


DIC1 










1.000 


-0.069 


-0.115 


DIC2 












1.000 


0.102 



Table 3: Correlation matrix 

derivative of this case were estimated as 

A ps 5.6, (128) 
« 7.9, (129) 
u'(l) ~ 3.6. (130) 

Note that, if the true distribution is regular for and realizable by the statistical 
model, A = 2/(1) = d/2 = 9 and z/(l) = 0. The averages of the two deviance 
information criteria were not equal to that of the Bayes generalization error. The 
standard deviation of BG + CV was smaller than the standard deviations of BG 
and CV, which is in agreement with Theorem [2j 

Note that the standard deviation of BT was larger than those of CV and WAIC, 
which indicates that, even if the average value E[C„(n) — B t (n)] = 2vjn is known 
and an alternative cross-validation, such as the AIC, 

CV 3 = B t L(n) + 2u/n, (131) 

is used, then the variance of CV3 — L n was larger than the variances of C v L(n) — L n 
and WAIC(n) - L n . 

Table 3 shows the correlation matrix for several values. The correlation between 
CV and WAIC was 0.996, which indicates that Theorem 1 holds. The correlation 
between BG and CV was -0.854, and that between BG and WAIC was -0.873, which 
corresponds to Theorem 2. 

The accuracy of numerical approximation of the posterior distribution depends 
on the statistical model, the true distribution, the prior distribution, the Markov 
chain Monte Carlo method, and the experimental fluctuation. In the future, we 
intend to develop a method by which to design experiments. The theorems proven 
in the present paper may be useful in such research. 
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5.5 Birational Invariant 



Finally, we investigate the statistical problem from an algebraic geometrical view- 
point. 

In Bayes estimation, we can introduce an analytic function of the parameter 
space g : U — > W, 

w = g(u). (132) 

Let be its Jacobian determinant. Note that the inverse function g~ l is not 

needed if g satisfies the condition that {u £ U; \g'(u)\ = 0} is a measure zero set in 
U. Such a function g is referred to as a birational transform. It is important that, 
by the transform, 

p(x\w) i — y p(x\g(u)), (133) 
cp(w) h> y{g{u))\g'{u)l (134) 

the Bayes estimation on W is equivalent to that on U. A constant defined for a set 
of statistical models and a prior is said to be a birational invariant if it is invariant 
under such a transform w = g(u). 

The real log canonical threshold A is a birational invariant |Atiyah 70 IHiroanaka 64[ 



IKashiwara 761 IKollor et al. 98} IMustata 021 IWatanabe 09] that represents the alge- 
braic geometrical relation between the set of parameters W and the set of the optimal 
parameters Wq. Although the singular fluctuation is also a birational invariant, its 
properties remain unknown. In the present paper, we proved in Theorem 1 that 

E[B g L(n)} = E[C v L(n)] + o(l/n). (135) 

On the other hand, in Theorem 2, we proved that 

2A 

B g {n) + C v {n) = — + o p (l/n). (136) 

In model selection or hyperparameter optimization, eq. (j!35p shows that mini- 
mization of the cross-validation makes the generalization loss smaller on average. 
However, eq. H136I) shows that minimization of the cross-validation does not ensure 
minimum generalization loss. The widely applicable information criterion has the 
same property as the cross-validation. The constant A appears to exhibit a bound, 
which can be attained by statistical estimation for a given pair of a statistical model 
and a prior distribution. Hence, clarification of the algebraic geometrical structure 
in statistical estimation is an important problem in statistical learning theory. 



6 Conclusion 

In the present paper, we have shown theoretically that the leave-one-out cross- 
validation in Bayes estimation is asymptotically equal to the widely applicable in- 
formation criterion and that the sum of the cross-validation error and the general- 
ization error is equal to twice the real log canonical threshold divided by the number 
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of training samples. In addition, we clarified that cross-validation and the widely 
applicable information criterion are different from the deviance information criteria. 
This result indicates that, even in singular statistical models, the cross-validation 
is asymptotically equivalent to the information criterion, and that the asymptotic 
properties of these models are determined by the algebraic geometrical structure of 
a statistical model. 
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